Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft ProForma implementation #37

Merged
merged 33 commits into from
Jun 28, 2021
Merged

Conversation

mobiusklein
Copy link
Contributor

This is a draft implementation for reading and writing the ProForma notation for modified amino acid sequences. I worked to avoid adding dependencies here by making additional controlled vocabularies optional unless you try to parse a string that uses them, and then load them lazily from psims.

It still needs more documentation (especially about how to interact with some of its implementation details and which feature annexes it supports) and tests. I can likely inherit several of those from https://github.com/topdownproteomics/sdk/blob/master/tests/TopDownProteomics.Tests/ProForma/ProFormaParserTests.cs.

The ProForma specification is going through review now, but there's already discussion of an update to allow multiple modifications at a single position.

@mobiusklein mobiusklein marked this pull request as ready for review May 24, 2021 03:13
@mobiusklein
Copy link
Contributor Author

@levitsky This should finally be ready for review, with the fundamental functionality all in place.

This adds the parse_proforma function to parse a string in ProForma 2.0 format into a list of peptide position tokens and a dictionary of additional modification information (unlocalized, ambiguous or labile modifications, global modification rules, and so on), a to_proforma function to take that information and turn it back into a ProForma 2.0 string. It also includes a ProForma class which layers on a little more behavior like mass calculation, slicing, and searching for tags by ID.

The non-user-facing bits include all the baroque machinery for dealing with six different modification vocabularies, a more forgiving tokenizer, and a slightly borrowed test suite.

There is still some documentation to iron out, especially which "implementation level" this counts as, as it implements everything but inter-peptide cross-linking support. There's also how to make the users aware of how to control how additional controlled vocabularies are loaded. Right now it uses Unimod directly from pyteomics.mass.Unimod, but tries to import psims to load the rest, emitting an error message if it needs one of those databases and psims isn't installed.

Copy link
Owner

@levitsky levitsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is huge, thank you!
I left some questions/comments in the code. I don't have any use cases, but I was able to catch a couple of issues by trying to parse examples from ProForma README. We can try copying those into tests and adding a psims install to the GA workflow.
Otherwise, my only real concern is formula parsing. I left a comment about it in the code, too.
Thank you once again for the awesome work.

pyteomics/proforma.py Outdated Show resolved Hide resolved
pyteomics/proforma.py Outdated Show resolved Hide resolved
pyteomics/proforma.py Outdated Show resolved Hide resolved
from pyteomics.auxiliary import PyteomicsError, BasicComposition
from pyteomics.auxiliary.utils import add_metaclass

# To eventually be implemented with pyteomics port?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything you don't like about this dependency?

pyteomics/proforma.py Show resolved Hide resolved
load_psimod = partial(_needs_psims, 'PSIMOD')
load_xlmod = partial(_needs_psims, 'XLMOD')
load_gno = partial(_needs_psims, 'GNO')
obo_cache = None
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name does not seem to be used, is it necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name gets baked into the partial-made function so that the error is clear about which "entity" depended upon the other source, e.g.

>>> load_psimod()
ImportError: Loading PSIMOD requires the `psims` library. To access it, please install `psims`

Technically, we could just make the message "Loading this controlled vocabulary requires psims." and be done, but it feels less explicit.

Copy link
Owner

@levitsky levitsky Jun 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, this comment of mine referred to obo_cache only. It is not used in pyteomics code. Github displays four lines for context when the last one is the one I put the comment on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarifying. Yes, that variable is imported to allow proforma to expose control over the default file cache from psims. That way users could set the cache directory or disable the cache altogether if they wished. Otherwise they'd need to explicitly import it from psims. It probably isn't necessary to import it here and just needs to be documented clearly that it interacts with the psims cache mechanism.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Indeed, I think it's worth mentioning in the docs because the user may not even know that psims is used, or how it works with caches.

pyteomics/proforma.py Outdated Show resolved Hide resolved
@mobiusklein
Copy link
Contributor Author

Thank you for catching those leftover items.

Compliance levels:

  1. Base Level Support
    Represents the lowest level of compliance, this level involves providing support for:
  • Amino acid sequences
  • Protein modifications using two of the supported CVs/ontologies: Unimod and PSI-MOD.
  • Protein modifications using delta masses (without prefixes)
  • N-terminal, C-terminal and labile modifications.
  • Ambiguity in the modification position, including support for localisation scores.
  • INFO tag.
  1. Additional Separate Support
    These features are independent from each other:
  • Unusual amino acids (O and U).
  • Ambiguous amino acids (e.g. X, B, Z). This would include support for sequence tags of known mass (using the character X).
  • Protein modifications using delta masses (using prefixes for the different CVs/ontologies).
  • Use of prefixes for Unimod (U:) and PSI-MOD (M:) names.
  • Support for the joint representation of experimental data and its interpretation.
  1. Top Down Extensions
  • Additional CV/ontologies for protein modifications: RESID (the prefix R MUST be used for RESID CV/ontology term names)
  • Chemical formulas (this feature occurs in two places in this list).
  1. Cross-Linking Extensions
  • Cross-linked peptides (using the XL-MOD CV/ontology, the prefix X MUST be used for XL-MOD CV/ontology term names).
  1. Glycan Extensions
  • Additional CV/ontologies for protein modifications: GNO (the prefix G MUST be used for GNO CV/ontology term names)
  • Glycan composition.
  • Chemical formulas (this feature occurs in two places in this list).
  1. Spectral Support
  • Charge and chimeric spectra are special cases (see Appendix II).
  • Global modifications (e.g., every C is C13).

Copy link
Owner

@levitsky levitsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for solving the isotope parsing issue and laying out the supported features, this is very helpful. (Also for all the smaller edits.)
I noticed another couple of minor issues with the help of pyflakes.
Other than that, one question/suggestion I have is if you would agree with naming the entry level function just parse, for the sake of brevity and consistency (see parser.parse and all the read functions).

pyteomics/proforma.py Show resolved Hide resolved

def __getitem__(self, i):
if isinstance(i, slice):
props = self.properties.copy()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
props = self.properties.copy()

pyteomics/proforma.py Outdated Show resolved Hide resolved
pyteomics/proforma.py Outdated Show resolved Hide resolved
load_psimod = partial(_needs_psims, 'PSIMOD')
load_xlmod = partial(_needs_psims, 'XLMOD')
load_gno = partial(_needs_psims, 'GNO')
obo_cache = None
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Indeed, I think it's worth mentioning in the docs because the user may not even know that psims is used, or how it works with caches.

pyteomics/proforma.py Outdated Show resolved Hide resolved
@levitsky
Copy link
Owner

At this point I'm more than happy with the state of this PR. Please let me know if/when you think it's ready to merge.

@mobiusklein
Copy link
Contributor Author

Thank you.

There's another ProForma meeting tomorrow which may or may not introduce more changes. The ambiguous sequence region feature was a late addition. We'll see if more work is needed or if there are any comments from the group.

@mobiusklein
Copy link
Contributor Author

I've updated the documentation on psims to discuss the caching mechanism in a bit more detail. No new features have been added to ProForma since the last meeting, and likely the best way to get more feedback at this point is for people to try to use it. If you're satisfied with the level of documentation within the module itself, we can merge it.

@levitsky levitsky merged commit 4cee0bb into levitsky:master Jun 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants